Audio processing for voice simulated noise effects

ABSTRACT

Systems and methods may be used to process and output information related to a non-speech vocalization, for example from a user attempting to mimic a non-speech sound. A method may include determine a mimic quality value associated with an audio file by comparing a non-speech vocalization to a prerecorded audio file. For example, the method may include determining an edit distance between the non-speech vocalization and the prerecorded audio file. The method may include assigning a mimic quality value to the audio file based on the edit distance. The method may include outputting the mimic quality value.

CLAIM OF PRIORITY

This application claims the benefit of priority to U.S. ProvisionalApplication No. 62/570,520, filed Oct. 10, 2017, titled “AudioProcessing for Voice Simulated Noise Effects,” which is herebyincorporated herein by reference in its entirety.

BACKGROUND

Audio processing techniques have become increasingly good at detectingand outputting human speech. Speech processing techniques are ofteninadequate at processing non-speech sound because they rely onidentifying words, phrases, or syllables. When speech is not availablein an audio file, it may be more difficult to identify sounds or otheraspects of the audio file.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings, which are not necessarily drawn to scale, like numeralsmay describe similar components in different views. Like numerals havingdifferent letter suffixes may represent different instances of similarcomponents. The drawings illustrate generally, by way of example, butnot by way of limitation, various embodiments discussed in the presentdocument.

FIG. 1 illustrates a user interface on a device showing a noiseimitation game interaction with a social bot in accordance with someembodiments.

FIG. 2 illustrates a system for providing a noise imitation gameinteraction with a social bot in accordance with some embodiments.

FIGS. 3A-3B illustrate a dynamic time warping technique to compare audiofiles in accordance with some embodiments.

FIG. 4 illustrates a Mel spectrogram of an audio file in accordance withsome embodiments.

FIG. 5 illustrates a flowchart showing a technique for providing a noiseeffect guessing game to be played with a social bot in accordance withsome embodiments.

FIG. 6 illustrates a flowchart showing a technique for evaluating anuser submitted audio noise effect and providing feedback in accordancewith some embodiments.

FIG. 7 illustrates a flowchart showing a technique for comparing audiofiles using an edit distance technique in accordance with someembodiments.

FIG. 8 illustrates generally an example of a block diagram of a machineupon which any one or more of the techniques discussed herein mayperform in accordance with some embodiments.

DETAILED DESCRIPTION

Systems and methods for performing non-speech audio processingtechniques are described herein. The systems and methods describedherein may use the non-speech audio processing techniques to determinewhether two or more audio files are related, match, or otherwise includesimilar characteristics. An interactive social bot or social artificialintelligence (AI) (e.g., a chat bot) may be used to facilitate a gameincluding guessing a sound supplied by the social bot or submitting asound (e.g., non-speech audio) for evaluation by a system.

In an example, a game may be played with a social bot. For example, thesocial bot may appear to be impersonating a machine. The impressionssupplied by the social bot (e.g., on a user interface) may be filledwith mistakes and along the way, such that the social bot engages theuser to help her. The game may make a playful connection between thesocial bot and the world of machines, while emphasizing her human-likecuriosity. As discussed herein, the social bot may be referred to usingfemale gender pronouns, however any personality or gender may be usedwith the social bot described herein without deviating from the subjectmatter.

A technique may be used to detect whether a user is trying to imitate aneveryday object such as an alarm clock, train whistle, or cellphoneringtone, such as to create an interactive game experience in socialbot. The technique may include detecting if the user is making areasonable attempt at playing the game. In an example, use Mel FrequencyCepstrum Coefficients (MFCC) may be used to describe features of anaudio file, which is described further below. A vector comparisontechnique, such as dynamic time warping may be used to detect whether adatabase of examples (e.g., other humans imitating the same thing, oroptionally the real noise itself, such as an actual cellphone) is likethe human's input/imitation. In an example, the “correct” answer may beknown ahead of time, such as when the user is asked guess it or imitateit, so comparison files may be selected as a subset of a databaseaccording to only those examples which are known to be correct, speedingup the comparison process.

In an example, the game may run at scale using lambda functions (e.g.,of a cloud service) such that a large number of users may be supported,and with low latency game play (where comparison result can take acouple seconds or less).

Using the techniques described herein, a system or method may tellwhether a user is humming, for instance, a short clip of happy birthday,or whether their attempt at humming happy birthday is actually closer to“happy new year.” A humming detector may be used with a similar gameexperience using these techniques.

Additionally, this technology may be used as an initial heuristic tosave training data of actual user game play (when opted in by theusers). The training data may be used to train a deep learning/machinelearning algorithm (such as a convolutional neural network) to do moreintelligent detection of each of a number of classes (e.g. alarm clock,train whistle, police siren, cellphone ringtone, etc.), for example.

FIG. 1 illustrates a user interface 102 on a device 100 showing a noiseimitation game interaction with a social bot in accordance with someembodiments. The user interface 102 shows an example conversationbetween a social bot (messages displayed coming from the left side ofthe user interface 102) and a user (messages displayed coming from theright side of the user interface 102). The conversations includesinteractions, such as text interactions, and audio interactions 104 and106. The audio interaction 104 may be a non-speech vocalization (e.g.,as coming from the social bot, which may be prerecorded by a human),such as an imitation of a blender. The audio file 106 may be anon-speech vocalization by the user, for example of an alarm clock(e.g., imitating the item indicated in the previous text interactionfrom the social bot—the alarm clock).

The gameplay may include having the user trying to guess social bot'simpressions with the delight being driven by social bot's impressionitself, the uncanny choices of things to impersonate, or the strugglesof social bot to get it right. An example mechanic is to ask the user to“teach” social bot how to do a better impersonation. The social bot canthen give feedback on the user's input, leading to a reward at the endwhere the user gets an overall score/assessment.

Example High Level Game Design

TABLE 1 Stage 1 User triggers game through one of the two options above.Social bot gets users to guess what she is impersonating. Stage 2 After2 impersonations from social bot, user moves to Stage 2 where social botrequests impersonations from the user. Stage 3 After 2 impersonationsfrom user, wrap up and game exit in Stage 3.

TABLE 2 Trigger Intro @social bot: do your impressions Social bot: “Ofcourse! I've been @social bot: do you do working on a few. Wanna hearem?” impressions User: “Sure” @social bot: can you Social bot: “You'rethe best! impersonate Here's one . . . let me know what @social bot:talk like a machine you think.” @social bot: machines are all so{impersonation} noisy/loud! @social bot: what's with my toaster @socialbot: how come machines think they can make so much noise all the time,ever think about what they're feeling?

The game is designed in a way that it can be easily inserted into theflow of voice chat. Therefore, it can be used to fill lulls in theconversation, which may be detected by the following triggers. Forunintelligible responses, the social bot may consider environmentalsound effects.

TABLE 3 Trigger Intro {one word answer} Social bot: “so I've been ## mayinclude patterns around working on my impressions. when users havenothing to say. Do you got a minute to help me out?” In an example, maybe a User: “yes” separate story. Social bot: “Awesome sauce! I'm notgonna tell you what it is. Give me your best guess!” {impersonation}

Stage 1: The Social Bot's Impersonations

The goal is not for a set of perfect impersonations. For some responses,the social bot may have terrible attempts at impersonating a machine.For other responses, there may be a set of “better” versions that socialbot can follow up with. Some examples of machines/noises that may beidentified or mimicked may include:

TABLE 4 Impersonation inventory Hint Police Siren It's tough toimpersonate this one cause it flies Alarm Clock by real fast most of thetime. Toilet flushing Come on, you gotta use this at least once a day?Camera Click Once every two days? Not sure, I don't use it. Espressosteamer Are you sure you don't know it? You must drink Blender yourcoffee black. Window start-up Don't tell me you've never used a PC . . .Dentist drill The scariest sound for everyone even a 22 year Truckbacking up old AI with great teeth. Air horn I'm pretty sure likeeveryone at a Car hockey game has one of these. Video chat call soundHmmm, do you have a fancy electric one? I hear Cell Phone Original thoseare a lot quieter . . . Ringtone I thought everyone had one of theseback in the Video game sound day. They were built like bricks. effectI've been playing this game nonstop for 72 hours Kitty Cat so myimpersonation might be off at this point. Cutest. Machine. Ever.

For gameplay, there may be two stages. In the first stage, the socialbot is imitating these sounds, and user is guessing them by name (e.g.social bot: “Wooo wooo wooo!”, User: “Oh that's a train whistle!”). Usermay receive multiple tries at guessing per round. There may be tworounds per game. User may receive two wrong answers before they advanceto the next round. User may advance to next round if they get a singleright answer, such as so 2 right answers (one per round) per stage, andworst case four wrong answers (two wrong per round times two rounds) perstage. If user is being offensive, the user may be removed from the chator the game, for example after the third continuous detected offensivestatement (e.g. if user tries to harass the bot, or says offensive andinsulting things, etc.).

For the second stage, when the user advances, the game may be switchedaround such that the user imitates the noise. The social bot may thenattempt to tell whether or not the user has done a good job of imitatingthe noise (e.g., using the techniques described herein below). If atechnique is too good or has too many errors, its accuracy may beaugmented with a coin flip such that overall the successes and failuresremain balanced (e.g. at 60/40), in order to give a sense of “humility”or humanity to the game play (e.g., to tease the user that their attemptis not perfect, even if it's close, get the user excited to try the gameagain and have fun).

After the two stages are played, we will wrap up (sort of a “thirdstage”) and user can share their result with their friends and challengethem to play, or user can optionally, be challenged one last time by thesocial bot to see whether her attempt at imitation is on par (forexample, the social bot may intentionally give a silly/intentionally badimpersonation in this stage to tease the user and remain humble).

The entire interaction may be voice based, for example using highfidelity (e.g., lossless or above 256 kbps compression) cloud audio(e.g., prerecorded audio), with a voice actor/actress actuallyattempting to do the real sounds in the first stage of the game, orusing synthesized audio (e.g., for the unnamed aka “third stage” of thegame). In another example, other users' audio clips may be used when thethose users consent to allowing their voice imitations to be used (e.g.,as a congratulations, a user with a particularly good or bad match maybe offered the chance to allow their voice recording to be used). Thisinformation, with the user's consent, may also be used for training. Asstated above, this game can also be used for experiences such as “happybirthday” or “happy new year” where the user hums these tunes and wehave to guess what it is, or vice-versa.

Example Stage 1 Format Steps:

Lead-in for object/round #1—e.g. the social bot may say: “Here's oneI've been working on. What is this thing??”

Social bot's impersonation #1

User's guess

Correct/Wrong answer response for object

User's second guess (if first was wrong)

Lead-in response for object/round #2

social bot's impersonation #2

User's guess

Correct/Wrong answer response for object

User's second guess (if first was wrong)

Go to Stage 2

TABLE 5 Stage 1 example Social bot: “Yeah! I've been working on a few.## Stage 1 Lead-in Here's one . . . let me know what you think.” #1Social bot impersonation: {“good” ## Social bot's 1st police sirenimpersonation} impersonation User: “Is that a police siren?” ## Correctanswer Social bot: “Haha yeah!! response Jeez they're everywhere thesedays. ## Stage 1 Lead-in Can't a girl get some peace and quiet?” #2“Speaking of loud and annoying, ## Social bot's 2nd let me try thisone:” impersonation {“bad” alarm clock impersonation} ## User gets up toUser: “uh what is that?” two tries Social bot: “I guess I need to workon ## Wrong answer that one. It's an alarm clock! response Sooooannoying! Prob the most annoying person ## Social bot's (thing?) I talkwith daily. Oooh I have another response should one I think you'lldefinitely get:” show a thread of {“good” flushing toilet impersonation}connection between User: “that's a toilet” her and machines. Social bot:“Yep! I've been working hard on that one. The bathroom's totally out ofcommission right now.” {go to Stage 2}

Example Stage 1 Details

The social bot may send two impersonations in Stage 1. For eachimpersonation, the user may guess or may receive up to a number ofguesses (e.g., 2 or 3 guesses). After a number of wrong guesses (1, 2,3, etc.), the social bot may move on to the next impersonation/stage.When a guess is correct, the social bot may move on to the nextimpersonation or stage.

Example Stage 2: Getting Users to do Impressions

In this stage, the social bot is soliciting impersonations from theuser. If there is one the user did not successfully guess earlier, thatmay be used for prompting. Otherwise, a random sound may be promptedfrom our inventory. In another example, a sound may be determined basedon previous context (e.g., the user has previously detailed a love forcats).

Example Stage 2 Format

Lead-in for object #1—e.g. social bot says “We're a good team! I'll betI could learn a lot from your impressions. Let's hear your alarm clock.”

User impersonation #1

Social bot's rating of user impersonation #1

Lead-in for object #2

Social bot's rating of user impersonation #2

Go to Stage 3

For each user impersonation, the social bot may provide one of thefollowing responses:

If the response is offensive (read: queryblocked), the social bot maychoose from a set number of responses that indicate disgust and exit thegame.

If the response has recognizable words, the social bot may give a“thumbs down” response. ## This may be a “short word” rule.

If the response does not fall into one or two above, the social bot may,for example, return a split of 60% “thumbs up” and 40% “thumbs down”responses. In another example, the submitted audio file may be comparedto stored audio files to determine if the user's impression is accurate(e.g., using edit distance between the two audio files as detailedfurther below). When the impression is accurate, a thumbs up may beprovided, and when inaccurate, a thumbs down may be presented. Accuracymay be further modified based on a weighting (e.g., even if veryaccurate, a thumbs down may be presented once every 20 times to mix upthe game).

For thumbs up responses, the social bot may naturally ask the user tocontinue to help with another impression.

For thumbs down responses, if the social bot has not yet given the“good” impersonation, the social bot may give that and ask the user'sopinion. Otherwise, the social bot may for example 50% repeat socialbot's good impression, 50% move on to next impression.

TABLE 6 “Thumbs up” “Thumbs down” Blocked “WOW. You're good at “That wasalright I guess. I {show this. You're not using a think mine's betterthough.” queryblock real vacuum cleaner, are you? “that was prettyclose, reply, repeat Cuz that's cheating!” but I think it sounds lastmore like this” instruction}

TABLE 7 Stage 2 example Social bot: “So I def have some work to do. Itwould ## Stage help me out a tonnnnnn if you could give me your best 2lead-in alarm clock impersonation” ## Thumbs up {user impression}response from Social bot: “WOW. You're good at this. You're not Socialbot using a real alarm clock, are you? Cuz that's cheating. ### Here,the Mind doing another for me? social bot can go User: “Yeah” to thenext one Social bot: “Everytime you go into a store, there's that ##Thumbs down same chiming sound, do you know response form what I'mtalking about?” Social bot Social bot: “can you try to do the sound?” ##May give {user impression} buttons or thumbs Social bot: “that waspretty close, up/down but I think mine's better” {“good” social botimpersonation} Social bot: “what do you think?” User: “no that's wayworse”

Example Stage 3: Wrap Up

A wrap up editorial may be provided. For example, the social bot maysend: “Thanks for all the help. Looks like I got some work to do. I'llkeep you updated on my progress.”

In another example, the social bot may send: “I swear I'm getting worseat these as we go lol. Oh well, I'll keep trying. Be sure to check inwith me later to see how I'm doing.” In an example, the game may berepeated, a stage may be repeated, or a different conversation topic maybe selected. In another example, playthroughs may be limited to once persession.

In an example, during stage three, a user assessment may be presented.For example, depending on what percentage of “thumbs up” responses, theuser got, the social bot may give the user an assessment. For example,if the user's “thumbs ups” is greater than or equal to the user's“thumbs down”, the social bot may prompt the user to give an impressionof something of their choosing. In another example, the social bot maygo directly to the wrap up.

TABLE 8 Stage 3 example Social bot: “You'd think I'd be better at thislol. ### Cut off at one You're pretty great at impressions. What othersentence in an machine do you think I should study up on?” example User:“A chainsaw” ## Thumbs up Social bot: “Hmm yeah I dunno much about that.response from How about you give me your best impression Social bot ofit and then I'll try?” ## Thumbs down {user impression} response formSocial bot: “I think I got it” Social bot for {generic/catch-all socialbot impression} example User: “that's not even close lol” Social bot:“HEY I'm TRYING” {wrap up}

In an example, the user may select a button to change between stages orgo to a next stage. When the user does not advance, in an example, theuser may be started at that stage the next time the chat is opened. Forexample, progress may be saved.

FIG. 2 illustrates a system 200 for providing a noise imitation gameinteraction with a social bot in accordance with some embodiments. Thesystem 200 illustrates a user device 202, which may be in communicationwith a social bot cloud service 204. The social bot cloud service 204may access an audio file database 206, which may be structured. The userdevice 202 may be used to present a user interface for chatting with asocial bot (e.g., the user interface 102 of FIG. 1). The social botcloud service 204 may be used to supply impressions or interactions froma social bot to the user device 202. The audio file database 206 mayinclude prerecorded audio files for comparing to user submittedimitations. The audio file database 206 may include subsets of theprerecorded audio files, such as grouped by imitation type (e.g., asubset of alarm clock imitations, a subset of cat meow imitations,etc.).

In an example, the game or stages described above with respect to FIG. 1may be played on the user device 202 in communication with the socialbot cloud service 204. In an example, a plurality of user devices may beused to play together (e.g., see which user submits a closest imitation,as judged by the social bot). An audio file may be recorded using theuser device 202. The audio file may be sent to the social bot cloudservice 204 to compare against prerecorded audio. In some examples, theaudio may be streamed in real-time to the social bot cloud service 204and assembled into the audio file at the social bot cloud service 204.The audio file may include non-speech noises (e.g., a machine imitation,a hummed song, “shhhhh,” etc.). Guessing what the social bot isimpersonating may be used as a warmup to get the user into the game(e.g., it is easier to say “cat” out loud than to “meow”). Otherexamples for the audio file include notification sounds for a suite ofdevices, time specific suggestions, TV show themes to be hummed, happybirthday—hum the song, show tunes, etc.

In an example, the social bot cloud service 204 may provide a scavengerhunt type of game, for example sending: “hey, I'm in my house, what isthis noise (plays vacuum, AC, etc.)” or go find these ten noises—e.g., avacuum cleaner, rain, a car, birds, etc.

In an example, submitting an accurate (e.g., within an edit distance)audio file may unlock a token using an impression. For example, thetoken may be used as implemented in an alternate reality game, forexample, the token may be a gateway to the alternate reality game. Inanother example, a successful or accurate impression may result indigital content being unlocked or sent to the user device 202. In yetanother example, the social bot may ask the user for permission to sharethe impression on public media.

When guessing the social bot's impressions, the user may submit ananswer via the user device 202 with one or more actions. For example,the user may speak ‘cat’, type ‘cat’, send an image of a cat, send anemoji of a cat, an emoticon, or the like, to guess what the impressionis. When the user performs an impression, the social bot may give arating. In another example, the social bot may give feedback correlatedto the rating—this may say ‘hey you're spot on’ or ‘hey you should workon that a bit more’. In another example, the social bot may provideneutral feedback, somewhat ambiguous feedback, or the like. The socialbot may identify the impression, such as if unclear or if the usersubmits an audio file without prompting and the social bot cloud service204 is able to identify the impression from the audio file database 206.

FIGS. 3A-3B illustrate a dynamic time warping technique to compare audiofiles in accordance with some embodiments. FIG. 3A illustrates two audiofiles in the time domain (e.g., power or energy of the audio files overtime). The first audio file 302 may illustrate a first imitation or anactual version of a non-speech noise. The second audio file 304 mayillustrate an imitation of the non-speech noise. In an example, thesecond audio file 304 may be offset from the first audio file 302, suchthat the start of the imitation of the second audio file 304 is notaligned in time with the start of the imitation or the actual version ofthe first audio file 302. In another example, the first audio file 302may have a different speed or time duration for the imitation or actualversion of the non-speech noise than the imitation of the second audiofile 304. FIG. 3B illustrates a section of frames 306 of the first audiofile and a section of frames 308 of the second audio file.

In an example, dynamic time warping may be used to align and reformatone or more of the first audio file 302, the second audio file 304, thefirst section of frames 306, or the second section of frames 308. Theframes may then be compared to determine a distance between the audiofiles 302 and 304. The distance may be an edit distance (e.g., aLevenshtein distance).

In an example, determining the edit distance using dynamic time warping,wherein the first row/column may be initialized to infinity, and thedistance function may include a Euclidean distance between two vectors.For example, each vector may be a frame/time-step in the source ortarget audio, respectively. The dynamic time warping technique outputsan alignment (e.g., series of inserts, deletes, or substitutions) totransform the source audio (e.g., the first audio file 302) into thetarget (e.g., the second audio file 304) or vice versa. The techniquemay return the alignment or the distance itself (e.g., a number of“edits” to go from source to target as described). This edit distance(e.g., over all the frames between the source and target) may be used asa comparison metric for how close the source is to the target.

The edit distance may be normalized. For example, the edit distance mayhave a standard deviation of one (e.g., compute standard deviation(stdev) over all frames between source and target, where target is eachaudio in the database we are comparing the source to); divide each MelFrequency Cepstrum Coefficient (MFCC) column (where 13coefficients/columns may be used, or delta between each column and frame[e.g. current frame, current column's MFCC value minus the previousframe, previous column's MFCC value], or a delta of that delta aka“double delta”) by that stddev, which results in data with a stddev of1.

In another example, edit distance may be normalized using featurescaling. For each MFCC value in each column of each audio frame ofsource and target, a subtraction by the mean for thatcoefficient/MFCC-column, e.g. 1 of 13 [or 1 of 36 with deltas and doubledeltas, for example adding 13 numbers with the deltas, and another 13with the double deltas, for a total of 39, which may include a delta ora double-delta for each MFCC coefficient or column] may be used. Thismay be computed in the same way as the stddev over all the datadescribed above. This feature scaling may allow the data to have a “zeromean”.

In yet another example, the edit distance may be normalized using atechnique for each MFCC value x, with other conditions remaining thesame, performing a (x−min)/(max−min) where max and min are computed overall the MFCC values per column in the same way described above. Thisnormalizes the data to between 0 and 1 and acts as a percentage.

In an example, some of the audio frames of the first subset of frames306 or the second subset of frames 308 may be trimmed. For example, theframes that don't appear to be voiced may be removed (e.g., based onenergy or other heuristics such as signal to noise ratio). These framesmay be likely to be white noise and contribute to error in thecomputation. In another examples, one or both of the subsets 306 or 308may be trimmed by truncating the audio for source or target after afixed number of frames (e.g., the sample average, or a fixed number,e.g. 3 or 4 seconds of frames, where each frame may be 25 ms forexample).

In an example, a technique may include determining whether the audiofile 304 is close enough to the audio file 302 using one or more of thetechniques described below.

When the audio file 304 is below the minimum threshold (e.g.,established empirically or through machine learning training to find theoptimal threshold), then a match may be determined. When the audio file304 is below the “average” threshold similar to the minimum (e.g.,usually higher than min), then a match may be determined. When the audiofile 304 is below both the minimum and average (this may increaseprecision at cost of recall, so there are less false positives), a matchmay be determined.

In an example, the minimum value may be compared to a sample ofimpressions and a sample of non-impressions, and whichever is comparedto be closer may be selected as the response (e.g., when closer to theimpression than the non-impression, the impression is deemed accurate orvalid). For example, when the minimum value or nearest neighbor to theaudio file 304 is an impression of a vacuum, instead of actual peopletalking, then the audio file 304 may be declared as a match to a vacuumimpression.

When an output of a machine learning classifier (e.g., a support vectormachines) determines the audio file 304 impression matches the audiofile 302, it may be deemed a match. Features (e.g., inputs) to the modelmay include: a computed dynamic time warping distance above (e.g.,raw/original value, and all normalized variations), one of each of these(e.g., a set of these features for source, and a set for target), the F0(fundamental frequency), min, max, mean, stddev, or the like. In anexample, for each MFCC column, other features may be used, such as theabsolute position (frame number), including a maximum MFCC value (fromeach column/coefficient) across all the frames, a minimum MFCC value, amean of the MFCC, percentiles or quartiles of the MFCC, a stddev of theMFCC, a slope of a line fitted to the MFCC contour, am error (of actualvs slope/predicted), a percent of time (e.g., number of frames) that themin or max is above or below each of the percentiles (1%, 50%, 75%, 90%,99%), or the like. In an example, features may include aroot-mean-square signal frame energy, MFCC coefficients 1-12,zero-crossing rate of time signal, a voicing probability, a skewness(3^(rd) order moment), a kurtosis (4^(th) order moment), normalizedloudness, logarithmic power of Mel-frequency bands 0-7, an envelope of asmoothed fundamental frequency contour, absolute position of a max valueor a min value, a slope or offset of a linear approximation of acontour, a smoothed fundamental frequency contour, a frame to framejitter, a differential frame to frame jitter (e.g. the jitter of thejitter), a frame to frame shimmer (e.g., amplitude deviation betweenpitch periods), or the like.

In another example, the output of a deep learning classifier (e.g., aconvolutional neural network) may be used to determine whether the audiofile 302 matches the audio file 304.

The deep learning classifier may use an input of an image (e.g., aspectrum/spectrogram [see FIG. 4 below]). For example, the x/horizontaldimension may be a timestamp (e.g., frame number of frames 308), whichmay be fixed (e.g., up to 1000 frames/timesteps). The y/verticaldimension may be a frequency. Each timestamp/frame may be a histogram offrequencies detected (e.g., with one of them being the F0, fundamentalfrequency). Formants in the audio may also be visualized this way.

This “image” (e.g., a matrix) may be input into the convolutional neuralnetwork, which may operate or convolve over the spectrogram to detectfeatures (e.g., a shared weight matrix). The features may run through aseries of hidden layers to detect, at the end, true/false (e.g.,recognized or not, authentic or not, or close enough or not to thetrained frames or audio file 302). Training data may include thesespectrograms (e.g., Mel/MFCC frequency data or raw waveform spectrum),such as with a fixed size (e.g., always 3-4 seconds worth of audio,etc.). In an example, the accuracy may be increased by using transferlearning when a good dataset is used (e.g., a large trainedconvolutional neural network model on speech data, similar to how resnetor imagenet or vgg is used for transfer learning in the imageclassification domain, but in the audio domain and on audio examples).

In an example, the accuracy of any of the approaches described above maybe increased using speech recognition. Words may be detected, such as“woosh” or “sh” or when the words are closer to gibberish than to realsentences (detected using language modelling techniques such as markovchains/n-gram models, which may be smoothed or used with backofftechniques like katz backoff, laplace backoff, etc.). The audio file 304may be declared as an “impression” rather than a “non-impression” (e.g.,where a non-impression is a user not attempting to make an effort towardusing the skill) when the gibberish words are detected or anon-impression when actual words are detected.

FIG. 4 illustrates a Mel spectrogram 400 of an audio file in accordancewith some embodiments. The Mel spectrogram 400 may be used with the deeplearning classifier as described above for FIGS. 3A-3B.

FIG. 5 illustrates a flowchart showing a technique 500 for providing anoise effect guessing game to be played with a social bot in accordancewith some embodiments. The technique 500 includes an operation 502 toprovide or receive an interaction initiating an impression game. In anexample, operation 502 may start in response to operation 610 completingfrom FIG. 6 below, may be initiated by a user, or by a social bot, orthe like. The technique 500 includes an operation 504 to send an audiofile including a sound generated to mimic a non-speech sound.

The technique 500 includes an operation 506 to receive an interactionfrom a user including a guess of the mimicked non-speech sound. In anexample, the interaction from the user may include a text response, aspoken response, an emoji response, an image response, an emoticonresponse, or the like. The technique 500 includes a decision operation508 to determine whether a guess is correct. The technique 500 includesan operation 510 to ask the user to try again in response to the guessbeing incorrect, in an example. The technique 500 may continue to ‘A’,may return to operation 506 to receive a second guess, for example, ormay end. In an example, when the guess is incorrect, the social bot mayprovide a contextual clue related to the non-speech sound. The technique500 includes an operation 512 to output feedback indicating a correctguess, in response to the guess being correct, in an example. Thetechnique 500 may continue to ‘A’ or may end.

FIG. 6 illustrates a flowchart showing a technique 600 for evaluating anuser submitted audio noise effect and providing feedback in accordancewith some embodiments. The technique 600 includes an operation 602 toprovide an interaction initiating an impression game. Operation 602 maystart in response to operation 510 or 512 completing from FIG. 5, in anexample, may be initiated by a user, or by a social bot. The technique600 includes an operation 604 to indicate a non-speech sound to bemimicked. The technique 600 includes an operation 606 to receive anaudio file including a non-speech vocalization from a user attempting tomimic the non-speech sound. In an example, the non-speech sound to bemimicked may include an animal noise, a machine generated noise, amelody, or the like.

The technique 600 includes an operation 608 to determine a mimic qualityvalue associated with the audio file by comparing the non-speechvocalization to a prerecorded audio file in a database. Operation 608may include comparing the non-speech vocalization to a plurality ofprerecorded audio files in a database. In an example, the database maybe a structured database of prerecorded audio files arranged bynon-speech sound. Comparing the non-speech vocalization to theprerecorded audio file may include selecting the prerecorded audio filefrom the structured database based on the non-speech sound to bemimicked indicated in the interaction. A prerecorded audio file may be arecording of the non-speech sound to be mimicked, such as a recording ofa machine or animal. In another example, a prerecorded audio file may bea recording of a person mimicking the non-speech sound. Operation 608may include determining whether the non-speech vocalization is within apredetermined edit distance of the prerecorded audio file. For example,the edit distance may be determined using a minimum threshold, anaverage threshold, both a minimum and an average threshold, a comparisonof the minimum distance value to a sample of impressions and a sample ofnon-impressions, a machine learning classifier (e.g., a support vectormachine), a deep learning classifier (e.g., a convolutional neuralnetwork) or the like.

The technique 600 includes an operation 610 to output a response to thereceived audio file based on the mimic quality value. In an example, theresponse may be neutral when the mimic quality value is determined to below or negative. In an example, when the response may be positive whenthe non-speech vocalization is within the predetermined edit distance.In an example, a token may be provided via the user interface inresponse to the mimic quality value exceeding a threshold. The token maybe used to unlock digital content. In an example, operation 610 mayinclude using dynamic time warping or MFCC, for example by performing afast fourier transform on the audio file and the prerecorded audio file,mapping results of the fast fourier transform to a mel scale, anddetermining amplitudes of the results mapped to the mel scale, includinga first series of amplitudes corresponding to the audio file and asecond series of amplitudes corresponding to the prerecorded audio file.The edit distance may be a number of changes, substitutions, edits, ordeletions needed to convert the audio file to the prerecorded audiofile. In an example, a discrete cosine transform operation may beperformed after the fast fourier transform, which may be used togeneralize or compress the audio file or the prerecorded audio file.

The audio file may be normalized using a standard deviation. Thetechnique 600 may include detecting a spoken word in the audio file. Thespoken word may be used to determine the mimic quality value. In anexample, comparing the non-speech vocalization to the prerecorded audiofile may include comparing an extracted speech portion of the audio fileto a speech portion of the prerecorded audio file.

FIG. 7 illustrates a flowchart showing a technique 700 for comparingaudio files using an edit distance technique in accordance with someembodiments. The technique 700 may be used in conjunction withoperations from techniques 500 or 600 described above with respect toFIGS. 5 and 6 respectively, or may operate as a stand-alone technique.For example, the technique 700 may process user supplied audio fromtechnique 600. In another example, the technique 700 may process audiounconnected to a social bot or social AI. For example, the technique 700may be used to provide security (e.g., comparing audio frames toprerecorded audio frames to determine whether credentials in the form ofa user voiced noise effect have been supplied by the user). In anotherexample, the technique 700 may be used for selecting an item or orderinga product (e.g., by receiving a noise effect vocalized by a user, suchas a cow sound ‘moo’, the user may be selecting milk). In anotherexample, the technique 700 may be used with an alarm clock, for exampleby requiring a user to mimic a noise effect to ensure the user is awaketo turn off the alarm.

The technique 700 includes an operation 702 to receive an audio fileincluding a non-speech vocalization. The non-speech vocalization mayinclude a machine generated sound, an animal or instrument generatedsound, or a human recorded voice mimicking a sound. The technique 700includes an optional operation 704 to identify a prerecorded non-speechvocalization in a structured database. The technique 700 includes anoptional operation 706 to generate Mel Frequency Cepstrum Coefficientscorresponding to the non-speech vocalization and the prerecordednon-speech vocalization.

The technique 700 includes an operation 708 to determine an editdistance between the non-speech vocalization and the prerecordednon-speech vocalization. Operation 708 may include performing dynamictime warping on the Mel Frequency Cepstrum Coefficients corresponding tothe audio file and the Mel Frequency Cepstrum Coefficients correspondingto the prerecorded audio file. In an example, the edit distance mayinclude a Euclidean distance between two vectors corresponding to framesof the audio file and the prerecorded audio file (e.g., using aLevenshtein distance technique).

The technique 700 includes an operation 710 to assign a mimic qualityvalue to the audio file based on the edit distance. In an example,assigning the mimic quality value includes determining whether the editdistance falls within a predetermined maximum edit distance. Operation708 may include normalizing the edit distance, for example using astandard deviation, such as a standard deviation set to equal 1 for theMFCC, feature scaling, trimming audio frames (or truncating), or thelike. Operation 710 may include determining whether the edit distancebetween the non-speech vocalization and the prerecorded audio file iswithin a threshold edit distance. The threshold may be a minimumthreshold determined through machine learning, an average threshold, aminimum of the minimum threshold determined through machine learning andthe average threshold, or the like. Operation 710 may include comparingthe edit distance to a second edit distance between the non-speechvocalization and a base audio recording file, which may include speechvocalizations. Operation 710 may include using a machine learningclassifier (e.g., a support vector machine) to determine whether thenon-speech vocalization matches the prerecorded audio file. Operation710 may include using a deep learning classifier (e.g., a convolutionalneural network) to determine whether the non-speech vocalization matchesthe prerecorded audio file. The technique 700 includes an operation 712to output the mimic quality value for the audio file.

FIG. 8 illustrates generally an example of a block diagram of a machine800 upon which any one or more of the techniques (e.g., methodologies)discussed herein may perform in accordance with some embodiments. Inalternative embodiments, the machine 800 may operate as a standalonedevice or may be connected (e.g., networked) to other machines. In anetworked deployment, the machine 800 may operate in the capacity of aserver machine, a client machine, or both in server-client networkenvironments. In an example, the machine 800 may act as a peer machinein peer-to-peer (P2P) (or other distributed) network environment. Themachine 800 may be a personal computer (PC), a tablet PC, a set-top box(STB), a personal digital assistant (PDA), a mobile telephone, a webappliance, a network router, switch or bridge, or any machine capable ofexecuting instructions (sequential or otherwise) that specify actions tobe taken by that machine. Further, while only a single machine isillustrated, the term “machine” shall also be taken to include anycollection of machines that individually or jointly execute a set (ormultiple sets) of instructions to perform any one or more of themethodologies discussed herein, such as cloud computing, software as aservice (SaaS), other computer cluster configurations.

Examples, as described herein, may include, or may operate on, logic ora number of components, modules, or mechanisms. Modules are tangibleentities (e.g., hardware) capable of performing specified operationswhen operating. A module includes hardware. In an example, the hardwaremay be specifically configured to carry out a specific operation (e.g.,hardwired). In an example, the hardware may include configurableexecution units (e.g., transistors, circuits, etc.) and a computerreadable medium containing instructions, where the instructionsconfigure the execution units to carry out a specific operation when inoperation. The configuring may occur under the direction of theexecutions units or a loading mechanism. Accordingly, the executionunits are communicatively coupled to the computer readable medium whenthe device is operating. In this example, the execution units may be amember of more than one module. For example, under operation, theexecution units may be configured by a first set of instructions toimplement a first module at one point in time and reconfigured by asecond set of instructions to implement a second module.

Machine (e.g., computer system) 800 may include a hardware processor 802(e.g., a central processing unit (CPU), a graphics processing unit(GPU), a hardware processor core, or any combination thereof), a mainmemory 804 and a static memory 806, some or all of which may communicatewith each other via an interlink (e.g., bus) 808. The machine 800 mayfurther include a display unit 810, an alphanumeric input device 812(e.g., a keyboard), and a user interface (UI) navigation device 814(e.g., a mouse). In an example, the display unit 810, alphanumeric inputdevice 812 and UI navigation device 814 may be a touch screen display.The machine 800 may additionally include a storage device (e.g., driveunit) 816, a signal generation device 818 (e.g., a speaker), a networkinterface device 820, and one or more sensors 821, such as a globalpositioning system (GPS) sensor, compass, accelerometer, or othersensor. The machine 800 may include an output controller 828, such as aserial (e.g., universal serial bus (USB), parallel, or other wired orwireless (e.g., infrared (IR), near field communication (NFC), etc.)connection to communicate or control one or more peripheral devices(e.g., a printer, card reader, etc.).

The storage device 816 may include a machine readable medium 822 that isnon-transitory on which is stored one or more sets of data structures orinstructions 824 (e.g., software) embodying or utilized by any one ormore of the techniques or functions described herein. The instructions824 may also reside, completely or at least partially, within the mainmemory 804, within static memory 806, or within the hardware processor802 during execution thereof by the machine 800. In an example, one orany combination of the hardware processor 802, the main memory 804, thestatic memory 806, or the storage device 816 may constitute machinereadable media.

While the machine readable medium 822 is illustrated as a single medium,the term “machine readable medium” may include a single medium ormultiple media (e.g., a centralized or distributed database, orassociated caches and servers) configured to store the one or moreinstructions 824.

The term “machine readable medium” may include any medium that iscapable of storing, encoding, or carrying instructions for execution bythe machine 800 and that cause the machine 800 to perform any one ormore of the techniques of the present disclosure, or that is capable ofstoring, encoding or carrying data structures used by or associated withsuch instructions. Non-limiting machine readable medium examples mayinclude solid-state memories, and optical and magnetic media. Specificexamples of machine readable media may include: non-volatile memory,such as semiconductor memory devices (e.g., Electrically ProgrammableRead-Only Memory (EPROM), Electrically Erasable Programmable Read-OnlyMemory (EEPROM)) and flash memory devices; magnetic disks, such asinternal hard disks and removable disks; magneto-optical disks; andCD-ROM and DVD-ROM disks.

The instructions 824 may further be transmitted or received over acommunications network 826 using a transmission medium via the networkinterface device 820 utilizing any one of a number of transfer protocols(e.g., frame relay, internet protocol (IP), transmission controlprotocol (TCP), user datagram protocol (UDP), hypertext transferprotocol (HTTP), etc.). Example communication networks may include alocal area network (LAN), a wide area network (WAN), a packet datanetwork (e.g., the Internet), mobile telephone networks (e.g., cellularnetworks), Plain Old Telephone (POTS) networks, and wireless datanetworks (e.g., Institute of Electrical and Electronics Engineers (IEEE)802.11 family of standards known as Wi-Fi®, IEEE 802.16 family ofstandards known as WiMax®), IEEE 802.15.4 family of standards,peer-to-peer (P2P) networks, among others. In an example, the networkinterface device 820 may include one or more physical jacks (e.g.,Ethernet, coaxial, or phone jacks) or one or more antennas to connect tothe communications network 826. In an example, the network interfacedevice 820 may include a plurality of antennas to wirelessly communicateusing at least one of single-input multiple-output (SIMO),multiple-input multiple-output (MIMO), or multiple-input single-output(MISO) techniques. The term “transmission medium” shall be taken toinclude any intangible medium that is capable of storing, encoding orcarrying instructions for execution by the machine 800, and includesdigital or analog communications signals or other intangible medium tofacilitate communication of such software.

Various Notes & Examples

Each of these non-limiting examples may stand on its own, or may becombined in various permutations or combinations with one or more of theother examples.

Example 1 is a device comprising: a display to provide a user interfacefor interacting with a social bot; and a processor to: provide aninteraction initiating an impression game within the user interface withthe social bot, the interaction indicating a non-speech sound to bemimicked; receive an audio file including a non-speech vocalization froma user attempting to mimic the non-speech sound via the user interface;determine a mimic quality value associated with the audio file bycomparing the non-speech vocalization to a prerecorded audio file in adatabase; and output a response to the received audio file from thesocial bot for display on the user interface based on the mimic qualityvalue.

In Example 2, the subject matter of Example 1 includes, wherein theresponse is neutral when the mimic quality value is determined to be lowor negative.

In Example 3, the subject matter of Examples 1-2 includes, whereindetermining the mimic quality value includes comparing the non-speechvocalization to a plurality of prerecorded audio files in the database.

In Example 4, the subject matter of Examples 1-3 includes, wherein theprerecorded audio file is a recording of the non-speech sound to bemimicked.

In Example 5, the subject matter of Examples 1-4 includes, wherein theprerecorded audio file is a recording of a person mimicking thenon-speech sound.

In Example 6, the subject matter of Examples 1-5 includes, generating anauditory interaction mimicking a second non-speech sound to be presentedfrom the social bot via the user interface.

In Example 7, the subject matter of Example 6 includes, receiving a userguess of the second non-speech sound in the auditory interaction, andproviding, from the social bot, a response to the user guess via theuser interface.

In Example 8, the subject matter of Example 7 includes, wherein the userguess includes at least one of a text response, a spoken response, anemoji response, an emoticon response, or an image response.

In Example 9, the subject matter of Examples 6-8 includes, presentingthe auditory interaction via the user interface from the social bot anda contextual clue related to the second non-speech sound.

In Example 10, the subject matter of Examples 1-9 includes, wherein thenon-speech sound to be mimicked includes an animal noise, a machinegenerated noise, or a melody.

In Example 11, the subject matter of Examples 1-10 includes, providing atoken via the user interface in response to the mimic quality valueexceeding a threshold, the token used to unlock digital content.

In Example 12, the subject matter of Examples 1-11 includes, whereindetermining the mimic quality value includes determining whether thenon-speech vocalization is within a predetermined edit distance of theprerecorded audio file.

In Example 13, the subject matter of Example 12 includes, wherein theresponse is positive when the non-speech vocalization is within thepredetermined edit distance.

In Example 14, the subject matter of Examples 12-13 includes, whereindetermining whether the non-speech vocalization is within thepredetermined edit distance of the prerecorded audio file includes usingdynamic time warping.

In Example 15, the subject matter of Examples 12-14 includes, whereindetermining whether the non-speech vocalization is within thepredetermined edit distance of the prerecorded audio file includes usingMel Frequency Cepstrum Coefficients representing the audio file and theprerecorded audio file to compare frames of the audio file to frames ofthe prerecorded audio file to determine an edit distance between theaudio file and the prerecorded audio file.

In Example 16, the subject matter of Example 15 includes, wherein theMel Frequency Cepstrum Coefficients are generated by performing a fastfourier transform on the audio file and the prerecorded audio file,mapping results of the fast fourier transform to a Mel scale, anddetermining amplitudes of the results mapped to the Mel scale, includinga first series of amplitudes corresponding to the audio file and asecond series of amplitudes corresponding to the prerecorded audio file.

In Example 17, the subject matter of Examples 12-16 includes, whereinthe edit distance is a number of changes, edits, or deletions needed toconvert the audio file to the prerecorded audio file.

In Example 18, the subject matter of Examples 12-17 includes,normalizing the audio file using a standard deviation across frames ofthe audio file.

In Example 19, the subject matter of Examples 1-18 includes, wherein thedatabase is a structured database of prerecorded audio files arranged bynon-speech sound, and wherein comparing the non-speech vocalization tothe prerecorded audio file includes selecting the prerecorded audio filefrom the structured database based on the non-speech sound to bemimicked indicated in the interaction.

In Example 20, the subject matter of Examples 1-19 includes, detecting aspoken word in the audio file and using the spoken word to determine themimic quality value.

In Example 21, the subject matter of Examples 1-20 includes, whereincomparing the non-speech vocalization to the prerecorded audio fileincludes comparing an extracted speech portion of the audio file to aspeech portion of the prerecorded audio file.

Example 22 is a method to perform a technique using any of the devicesof Examples 1-21.

Example 23 is at least one machine readable medium includinginstructions, which when executed by a machine, cause the machine toperform the technique of Example 22.

Example 24 is a method comprising: receiving an audio file including anon-speech vocalization and an identifier; identifying a prerecordedaudio file including non-speech sound in a structured database using theidentifier; determining an edit distance between the non-speechvocalization and the prerecorded audio file using dynamic time warping;assigning a mimic quality value to the audio file based on the editdistance, and outputting the mimic quality value for the audio file.

In Example 25, the subject matter of Example 24 includes, whereindetermining the edit distance between the non-speech vocalization andthe prerecorded audio file using dynamic time warping includesperforming dynamic time warping on a first set of Mel Frequency CepstrumCoefficients corresponding to the audio file and a second set of MelFrequency Cepstrum Coefficients corresponding to the prerecorded audiofile.

In Example 26, the subject matter of Examples 24-25 includes, whereinassigning the mimic quality value includes normalizing the editdistance.

In Example 27, the subject matter of Example 26 includes, whereinnormalizing the edit distance includes setting a standard deviation toequal one for the first set of Mel Frequency Cepstrum Coefficients.

In Example 28, the subject matter of Examples 26-27 includes, whereinnormalizing the edit distance includes feature scaling the first set ofMel Frequency Cepstrum Coefficients.

In Example 29, the subject matter of Examples 26-28 includes, whereinnormalizing the edit distance includes trimming audio frames from theaudio file including at least one of trimming non-voiced frames ortruncating end frames.

In Example 30, the subject matter of Examples 24-29 includes, whereindetermining the edit distance includes determining a Euclidean distancebetween two vectors corresponding to frames of the audio file and theprerecorded audio file.

In Example 31, the subject matter of Examples 24-30 includes, whereinthe non-speech sound is one of a vocalization or a machine generatedsound.

In Example 32, the subject matter of Examples 24-31 includes, whereinassigning the mimic quality value to the audio file includes determiningwhether the edit distance between the non-speech vocalization and theprerecorded audio file is within a threshold edit distance.

In Example 33, the subject matter of Example 32 includes, wherein thethreshold distance is a minimum threshold determined through machinelearning, an average threshold, or a minimum of the minimum thresholddetermined through machine learning and the average threshold.

In Example 34, the subject matter of Examples 24-33 includes, whereinassigning the mimic quality value to the audio file includes comparingthe edit distance to a second edit distance determine between thenon-speech vocalization and a base audio recording file.

In Example 35, the subject matter of Example 34 includes, wherein thebase audio recording file includes speech vocalizations.

In Example 36, the subject matter of Examples 24-35 includes, whereinassigning the mimic quality value to the audio file includes using amachine learning classifier to determine whether the non-speechvocalization matches the prerecorded audio file.

In Example 37, the subject matter of Example 36 includes, wherein themachine learning classifier uses a support vector machine.

In Example 38, the subject matter of Examples 24-37 includes, whereinassigning the mimic quality value to the audio file includes using adeep learning classifier to determine whether the non-speechvocalization matches the prerecorded audio file.

In Example 39, the subject matter of Example 38 includes, wherein thedeep learning classifier is a convolutional neural network.

In Example 40, the subject matter of Examples 24-39 includes, whereinassigning the mimic quality value to the audio file includes usingdetecting a spoken word in the audio file.

Example 41 is a device comprising: a display to provide a user interfacefor interacting with a social bot; and a processor to perform any of thetechniques of Examples 24-40.

Example 42 is at least one machine readable medium includinginstructions, which when executed by a machine, cause the machine toperform any of the techniques of Examples 24-40.

Example 43 is a method comprising: receiving an audio file including anon-speech vocalization and an identifier; identifying a prerecordednon-speech vocalization in a structured database using the identifier;generating a first set and a second set of Mel Frequency CepstrumCoefficients corresponding to the non-speech vocalization and theprerecorded non-speech vocalization respectively; determining an editdistance between the non-speech vocalization and the prerecordednon-speech vocalization by comparing the first set to the second setusing dynamic time warping; assigning a mimic quality value to the audiofile based on the edit distance; and outputting the mimic quality valuefor the audio file.

In Example 44, the subject matter of Example 43 includes, whereinassigning the mimic quality value includes determining whether the editdistance falls within a predetermined maximum edit distance.

Example 45 is a device comprising: a display to provide a user interfacefor interacting with a social bot; and a processor to perform any of thetechniques of Examples 43-44.

Example 46 is at least one machine readable medium includinginstructions, which when executed by a machine, cause the machine toperform any of the techniques of Examples 43-44.

Example 47 is at least one machine-readable medium includinginstructions that, when executed by processing circuitry, cause theprocessing circuitry to perform operations to implement of any ofExamples 1-46.

Example 48 is an apparatus comprising means to implement of any ofExamples 1-46.

Example 49 is a system to implement of any of Examples 1-46.

Example 50 is a method to implement of any of Examples 1-46.

Method examples described herein may be machine or computer-implementedat least in part. Some examples may include a computer-readable mediumor machine-readable medium encoded with instructions operable toconfigure an electronic device to perform methods as described in theabove examples. An implementation of such methods may include code, suchas microcode, assembly language code, a higher-level language code, orthe like. Such code may include computer readable instructions forperforming various methods. The code may form portions of computerprogram products. Further, in an example, the code may be tangiblystored on one or more volatile, non-transitory, or non-volatile tangiblecomputer-readable media, such as during execution or at other times.Examples of these tangible computer-readable media may include, but arenot limited to, hard disks, removable magnetic disks, removable opticaldisks (e.g., compact disks and digital video disks), magnetic cassettes,memory cards or sticks, random access memories (RAMs), read onlymemories (ROMs), and the like.

What is claimed is:
 1. A device comprising: a display to provide a userinterface for interacting with a social bot; memory; and a processor incommunication with the memory, the processor to: provide an indicationinitiating an impression game within the user interface with the socialbot, the indication indicating a non-speech sound to be mimicked;receive an audio file or streamed audio including a non-speechvocalization from a user attempting to mimic the non-speech sound viathe user interface; determine a mimic quality value associated with theaudio file or the streamed audio by comparing the non-speechvocalization to a prerecorded audio file in a database; and output aresponse to the received audio file or the streamed audio from thesocial bot for display on the user interface based on the mimic qualityvalue.
 2. The device of claim 1, wherein the prerecorded audio file is arecording of the non-speech sound to be mimicked or a recording of aperson mimicking the non-speech sound.
 3. The device of claim 1, whereinthe processor is further to provide a token via the user interface inresponse to the mimic quality value exceeding a threshold, the tokenused to unlock digital content.
 4. The device of claim 1, wherein todetermine the mimic quality value, the processor is further to determinewhether the non-speech vocalization is within a predetermined editdistance of the prerecorded audio file.
 5. The device of claim 4,wherein the response is positive when the non-speech vocalization iswithin the predetermined edit distance.
 6. The device of claim 4,wherein to determine whether the non-speech vocalization is within thepredetermined edit distance of the prerecorded audio file, the processoris further to use dynamic time warping.
 7. The device of claim 4,wherein to determine whether the non-speech vocalization is within thepredetermined edit distance of the prerecorded audio file, the processoris further to use Mel Frequency Cepstrum Coefficients representing theaudio file or the streamed audio and the prerecorded audio file tocompare frames of the audio file or the streamed audio to frames of theprerecorded audio file to determine an edit distance between the audiofile or the streamed audio and the prerecorded audio file.
 8. The deviceof claim 7, wherein the Mel Frequency Cepstrum Coefficients aregenerated by performing a fast fourier transform on the audio file orthe streamed audio and the prerecorded audio file, mapping results ofthe fast fourier transform to a Mel scale, and determining amplitudes ofthe results mapped to the Mel scale, including a first series ofamplitudes corresponding to the audio file or the streamed audio and asecond series of amplitudes corresponding to the prerecorded audio file.9. The device of claim 4, wherein the edit distance is a number ofchanges, edits, or deletions needed to convert the audio file or thestreamed audio to the prerecorded audio file.
 10. A method comprising:using a processor in communication with memory to: provide an indicationinitiating an impression game within a user interface on a display withthe social bot, the indication indicating a non-speech sound to bemimicked; receive an audio file or streamed audio including a non-speechvocalization from a user attempting to mimic the non-speech sound viathe user interface; determine a mimic quality value associated with theaudio file or the streamed audio by comparing the non-speechvocalization to a prerecorded audio file in a database; and output aresponse to the received audio file or the streamed audio from thesocial bot for display on the user interface based on the mimic qualityvalue.
 11. The method of claim 10, wherein to determine the mimicquality value, the processor is to compare the non-speech vocalizationto a plurality of prerecorded audio files in the database.
 12. Themethod of claim 10, wherein the prerecorded audio file is a recording ofthe non-speech sound to be mimicked or of a person mimicking thenon-speech sound.
 13. The method of claim 10, further comprising usingthe processor to generate an auditory interaction mimicking a secondnon-speech sound to be presented from the social bot via the userinterface.
 14. The method of claim 13, further comprising using theprocessor to receive a user guess of the second non-speech sound in theauditory interaction, and provide, from the social bot, a response tothe user guess via the user interface, wherein the user guess includesat least one of a text response, a spoken response, an emoji response,an emoticon response, or an image response.
 15. The method of claim 10,wherein to determine the mimic quality value associated with the audiofile includes using a machine learning classifier to determine whetherthe non-speech vocalization matches the prerecorded audio file.
 16. Themethod of claim 13, further comprising using the processor to presentthe auditory interaction via the user interface from the social bot anda contextual clue related to the second non-speech sound.
 17. The methodof claim 10, further comprising using the processor to detect a spokenword in the audio file or the streamed audio and use the spoken word todetermine the mimic quality value.
 18. The method of claim 10, whereinto compare the non-speech vocalization to the prerecorded audio fileincludes using the processor to compare an extracted speech portion ofthe audio file or the streamed audio to a speech portion of theprerecorded audio file.
 19. At least one non-transitory machine-readablemedium including instructions for performing operations, which whenexecuted by a processor, cause the processor to: provide an indicationinitiating an impression game within a user interface on a display withthe social bot, the indication indicating a non-speech sound to bemimicked, receive an audio file or streamed audio including a non-speechvocalization from a user attempting to mimic the non-speech sound viathe user interface; determine a mimic quality value associated with theaudio file or the streamed audio by comparing the non-speechvocalization to a prerecorded audio file in a database; and output aresponse to the received audio file or the streamed audio from thesocial bot for display on the user interface based on the mimic qualityvalue.
 20. The at least one machine-readable medium of claim 19, whereinthe database is a structured database of prerecorded audio filesarranged by non-speech sound, and wherein to compare the non-speechvocalization to the prerecorded audio file, the instructions furthercause the processor to select the prerecorded audio file from thestructured database based on the non-speech sound to be mimickedindicated in the indication.